Претрага
76 items
-
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action ``Distant Reading for European Literary History'' (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, ...Milica Ikonić Nešić, Ranka Stanković, Christof Schöch and Mihailo Škorić. "From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)" in Proceedings of The 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
SrpELTeC on Platforms: Udaljeno čitanje, Aurora, NoSketch
Serbian ELTeC collection (100 novels and extended) developed within COST action CA16204 Distant Reading for European Literary History comprises at this moment 111 novels published in the period 1840-1920. Such a valuable resource is and will be used for various lexical and linguistic research, by using different tools and methodologies. In this paper, three platforms on which these novels are published will be presented: “Udaljeno ˇcitanje”, Aurora and Sketch Engine.Ranka Stanković, Mihailo Škorić, Petar Popović. "SrpELTeC on Platforms: Udaljeno čitanje, Aurora, NoSketch" in Infotheca, Faculty of Philology, University of Belgrade (2022). https://doi.org/10.18485/infotheca.2021.21.2.7
-
Annotation of the Serbian ELTeC Collection
Ovaj rad predstavlja takozvano izdanje nivoa 2 kolekcije tekstova SrpELTeC razvijene u okviru aktivnosti Radne grupe 2 – Metode i alati COST akcije CA 16204 (Distant Reading for European Literary History) i njene specifikacije šeme. Izdanje nivoa 2 je nastavak izdanja nivoa 1, koje se koristi kao ulaz za morfosintaksičke i NER anotacije romana. Srpska obrada nivoa-2 je navedena kroz potrebne korake, uključujući metode i alate koji se koriste u tom procesu. Neki statistički podaci iz srpske kolekcije nivoa ...udaljeno čitanje, literarni korpus, tagiranje, prepoznavanje imenovanih entiteta, lematizacija, ELTeCRanka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Mihailo Škorić. "Annotation of the Serbian ELTeC Collection" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.2.3
-
Serbian ELTeC Sub-Collection in Wikidata
This paper presents an example of integration of Wikidata with digital libraries and external systems, as well as some best practices for speeding up the process of data preparation and import to Wikidata, on the use case of SrpELTeC, Serbian subcollection of the ELTeC multilingual collection (European Literary Text Collection). After preliminary work on the manual Wikidata population with SrpELTeC novels, the goal was to automate the process of preparing and importing information, so different solutions were analysed and ...Milica Ikonić Nešić, Ranka Stanković, Biljana Rujević. "Serbian ELTeC Sub-Collection in Wikidata" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.2.4
-
Нове технологије за оживљавање старих текстова
удаљено читање, књижевни корпус, обрада српског језика, анотација врстом речи, лематизација, именовани ентитетиЦветана Крстев, Ранка Станковић, Бранислава Шандрих Тодоровић, Милица Иконић Нешић. "Нове технологије за оживљавање старих текстова" in Зборник радова Међународне научне конференције Дигитална хуманистика и словенско културно наслеђе II, Београд, 28-29 јуни 2021., Београд : Савез славистичких друштава Србије (2023)
-
An Italian-Serbian Sentence Aligned Parallel Literary Corpus
This article presents the construction and relevance of an Italian-Serbian sentence-aligned parallel corpus, delving into the aligned sentences in order to facilitate effective translation between the two languages. The parallel corpus serves as a valuable resource for language experts, researchers, and language enthusiasts, fostering a deeper understanding of linguistic nuances and cultural expressions. By bridging the gap between Serbian and Italian, this corpus opens new avenues for cross-cultural communication and collaboration, and ultimately contributes to the improvement of language-related ...Saša Moderc, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić. "An Italian-Serbian Sentence Aligned Parallel Literary Corpus" in Review of the National Center for Digitization, Belgrade : Faculty of Mathematics, University of Belgrade (2023). https://doi.org/10.5281/zenodo.11203388
-
A Mathematical Learning Environment Based on Serbian Language Resources
In recent years, in line with ever growing usage of Information technology, the learning environments are changing. The amount of available learning materials in various forms has increased. These new environments demand comprehensive learning systems, which enable management of the learning corpus with special attention paid to relevant lexical resources. In this paper we present the concept of a Mathematical Learning Environment in Serbian (MLES), which is based on a corpus of mathematical materials and various lexical resources, enabling ...... is dedicated to corpus processing and alignment with existing lexical resources. (Figure 1). Figure 1. Structure of corpus processing The obtained results are processed text, augmented dictionaries and annotated content. In this component a special challenge to corpus processing results ...
... between tokens). Through CQP web users can both specify query patterns and get statistical information about corpus. Within preparation of the MLES corpus to each word within the corpus the following information is assigned, in the following order: Word type (noun, verb, adjective, etc.) - ...
... annotation. The advantages of corpus annotation are the following: Corpus search becomes more efficient due to the possibility of formulating more precise queries. When search results are concerned, annotation. Compensates for the information lost during corpus preparation (removed figures ...Radojičić Marija, Obradović Ivan, Stanković Ranka, Utvić Miloć, Kaplar Sebastijan. "A Mathematical Learning Environment Based on Serbian Language Resources" in Proceedings of the 7th International Scientific Conference Technics and Informatics in Education, Faculty of Technical Sciences, Čačak (2018)
-
Vebran Web Services for Corpus Query Expansion
Ranka Stanković, Miloš Utvić (2020)U ovom radu se govori o razvoju veb usluga Vebran i njihovoj primeni u poboljšanju pretraživanja korpusa. Veb-servisi Vebran koriste se za konsultovanje spoljnih leksičkih izvora za srpski jezik (uglavnom elektronski morfološki rečnici i srpski Vordnet) i proširivanje korisničkih upita radi dobijanja relevantnijih rezultata iz srpskih korpusa.... annotated. SrpKor2013 is the current version of SrpKor (Utvić, 2014), used as a refer- ence and general purpose corpus containing over 122 million corpus words. It includes literary texts of Serbian writers in the XX and XXI centuries, as well as scientific and popular science texts from different ...
... the internet por- tal “Peščanik”. Some of the texts are translations, most of which are literary texts, while a smaller part are translations of general texts. Apart from being morphologically annotated, the corpus texts are provided with correspond- ing bibliographic description, information concerning ...
... with documents on finance, health, law and education; – SrpFranKor4, aligned French-Serbian corpus; – SrpNemKor5, aligned German-Serbian corpus; – RudKor6, a specialized monolingual corpus of texts from the mining domain, etc. The query expansion will be demonstrated using examples in two Ser- bian ...Ranka Stanković, Miloš Utvić. "Vebran Web Services for Corpus Query Expansion" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.5
-
Towards translation of educational resources using GIZA++
... ripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.en --corpus \ ~/corpus/edX.tok.en ~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.sr --corpus \ ~/corpus/edX.tok.sr Finally cleaning and limiting the length ...
... odel.sr \ < ~/corpus/edX.tok.sr \ > ~/corpus/edX.true.sr ~/mosesdecoder/scripts/training/clean-corpus-n.perl \ ~/corpus/edX.true sr en \ ~/corpus/edX.clean 1 80 Language Model Training A language model (LM) is used to ensure fluent output, built with the target language ...
... ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < ~/corpus/training/edX.en \ > ~/corpus/edX.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sr \ < ~/corpus/training/edX.sr \ > ~/corpus/edX.tok.sr The truecaser first requires training, in order to extract ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)
-
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
-
Football terminology: compilation and transformation into OntoLex-Lemon resource
У овом раду представља се пројекат који је у развоју, креирање првог дигиталног фудбалског речника на српском језику, као и да демонстрација примене модела OntoLex и љегових модула. OntoLex-FrAC модул укључује информације о учесталости и примерима употребе екстрахованих из корпуса. У овом случају, креиран је корпус за специфичан домен под називом СрФудКо, који садржи чланке вести о фудбалу на српском језику. Вишечлани термини аутоматски су екстраховани из српског корпуса, а затим ручно евалуирани и класификовани као спортски или ...Jelena Lazarević, Ranka Stanković, Mihailo Škorić, Biljana Rujević. "Football terminology: compilation and transformation into OntoLex-Lemon resource" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj
-
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian
Uvredljivi govor na društvenim medijima, uključujući psovke, pogrdni govor i govor mržnje, dostigao je nivo pandemije. Sistem koji bi bio u stanju da detektuje takve tekstove mogao bi da pomogne da internet i društveni mediji postanu bolji virtuelni prostor sa više poštovanja. Istraživanja i komercijalna primena u ovoj oblasti do sada su bili fokusirani uglavnom na engleski jezik. Ovaj rad predstavlja rad na izgradnji AbCoSER-a, prvog korpusa uvredljivog govora na srpskom jeziku. Korpus se sastoji od 6.436 ručno označenih ...... should we take when building the corpus of abusive language, several future implications were considered: 1) To the best of our knowledge, AbCoSER (Abusive Corpus for serbian) is the first corpus tackling abusive language phenomenon in the Serbian language; 2) This corpus is to be used to enrich our lexicon ...
... n of tweets length. Figure 5 Tree clouds of abusive language corpus: the non-abusive subset (left) and abusive subset (right). 3 A Corpus and Lexicon for Abusive Speech 3.1 Analysis of the twitter corpus The distribution of our corpus tweets length after removal of mentions is shown in Figure 4: ...
... from the abusive tweet corpus) and for neutral language (derived from the corpus of non-abusive tweets), which enables calculation of the so-called keyness score, which should represent the extent of the frequency difference. These frequencies can also be compared with the corpus of standard Serbian (as ...Danka Jokić, Ranka Stanković, Cvetana Krstev, Branislava Šandrih. "A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian" in 3rd Conference on Language, Data and Knowledge (LDK 2021), MDPI AG (2021). https://doi.org/10.4230/OASIcs.LDK.2021.13
-
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
-
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... related to extraction of dictionary examples has shown that information extraction from a corpus can be used to speed up the work on the Serbian Academy of Sciences and Arts Dictionary of Serbo-Croatian Literary and Folk Language (SASA Dictionary) and other dictionaries (Stanković et al. 2019; Kitanović ...
... enabling the connection are given in italic. 4 Corpus Analysis 4.1 Creating a Textbook Corpus For testing the automatic extraction of definitions from unstructured text by means of local grammars presented in the previous section, we created a corpus of 25 primary and secondary school textbook covering ...
... 3 4.2 Recognition of Candidates in the Textbook Corpus To determine whether it is possible to recognize definitions of domain-specific terms in the domain corpus text, a subset of local grammars presented in Section 3 was applied to the corpus consisting of 25 textbooks; namely we excluded models ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
-
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Two approaches to compilation of bilingual multi-word terminology lists from lexical resources" in Natural Language Engineering, Cambridge University Press (CUP) (2020). https://doi.org/10.1017/S1351324919000615
-
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... within the domain corpus, whereas the remaining two measures compare term frequency in the domain corpus and the general language corpus, thus illustrating how specific the MWU is for the selected domain. As the general corpus we used a 22 million words excerpt from the Corpus of Contemporary ...
... For evaluation we used a corpus that contains 10textbooks, 2 projects and 51 journal articles from the mining domain. The size of this corpus is 32,633 sentences and 625,105 simple word forms. For calculation of measures that compare results on a domain corpus with general language we used ...
... Language Using Rules Automatically Generated Based on the Corpus Analysis. In Proc. of 7th Language & Technology Conference 2015, Poznań: Fundacja Uniwersytetuim. A. Mickiewicza, pp. 540--544. Pantel, P., Dekang L. (2001). A statistical corpus-based term extractor. In Stroulia, E., Matwin, S. (Eds ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
-
Corpus-based bilingual terminology extraction in the power engineering domain
Ovaj rad predstavlja resurse i alate koji se koriste za ekstrkciju i evaluaciju dvojezične, englesko-srpske terminologije u domenu energetike. Resursi se sastoje od postojeće opšte i domenske leksike i domenskog paralelnog korpusa; alati uključuju ekstraktore termina za oba jezika i alat za poravnavanje segmenata koji pripadaju korpusnim rečenicama. Sistem je testiran variranjem funkcije podudaranja koja utvrđuje prisustvo ekstrahovanog termina u poravnatom segmentu (odsečak), u rasponu od veoma labavog do strogog. Procena rezultata je pokazala da je preciznost izdvajanja termina ...Tanja Ivanović, Ranka Stanković, Branislava Šandrih Todorović, Cvetana Krstev. "Corpus-based bilingual terminology extraction in the power engineering domain" in Terminology, John Benjamins Publishing Company (2022). https://doi.org/10.1075/term.20038.iva
-
A Data Driven Approach for Raw Material Terminology
Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja (2021)The research presented in this paper aims at creating a bilingual (sr-en), easily searchable, hypertext, born-digital, corpus-based terminological database of raw material terminology for dictionary production. The approach is based on linking dictionaries related to the raw material domain, both digitally born and printed, into a lexicon structure, aligning terminology from different dictionaries as much as possible. This paper presents the main features of this approach, data used for compilation of the terminological database, the procedure by which it has ...sirovine, rudarstvo, terminologija, rečnik, terminološka aplikacija, mobilna aplikacija, digitizacija, leksički podaci, korpusi, otvoreni povezani podaci... a lexeme are grouped together under a base dictionary form. The Serbian corpus and the Serbian part of the bilingual corpus are tagged and lemmatized using a customised tagger [35], while the English part of the bilingual corpus is tagged by Treetagger [36,37]. Texts included in corpora are also processed ...
... between 3%–5%, but whether they would be included into SrpMD depended on their frequency in the mining corpus. Besides the digitized dictionaries, the Serbian corpus and the Serbian part of the bilingual corpus from the mining domain were yet another source of new raw material domain terms that did not exist ...
... several terms in paper dictionaries are not in use anymore. That observation initiated frequency calculation of Serbian terms in the mining corpus. Frequency in the corpus and the number of dictionaries that attest a term were the main criteria for post editing priority of the term. Entries from all digitized ...Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja. "A Data Driven Approach for Raw Material Terminology" in Applied Sciences, MDPI AG (2021). https://doi.org/10.3390/app11072892
-
FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain
U radu se daje kratak prikaz teorije semantike okvira, na kojoj je zasnovana leksička baza Frejmnet. Predstavljena je koncepcija ove mreže, kao i mogućnosti njene primene. Predstavljena je i leksička analiza koja se primenjuje u projektu izrade Frejmneta i ukazano na razlike između analize zasnovane na okviru u odnosu na analizu zasnovanu na reči. Zatim je prikazano nekoliko povezanih okvira koje prizivaju reči iz domena rizika. U radu je predstavljena i platforma NLTК pomoću koje se mogu koristiti ...... (Natural Language Toolkit) suite, which provides a good natural language pro- cessing resource. The last chapter shows a corpus search of the noun risk in a mining- themed corpus. We also present its most common collocates, word sketch, individual pattern concordances, thesaurus entry of its synonyms ...
... analysis of the meaning of an LU, its lexical surroundings, phrases and grammatical constructions in which it appears in the corpus, the context in which it is used provided by corpus examples, as well as all the phrases in which the LU fulfills its full semantic potential. This approach consists of listing ...
... 2016, 21) and its use is looked into by extracting sentences, which contain it, from the corpus. A lexicographer working in FrameNet compares his or her insight into the meaning of a target lexeme, based on corpus examples, to the meaning given in descriptive dictionaries.8 Once he gets a clearer idea ...Aleksandra Marković, Ranka Stanković, Natalija Tomić, Olivera Kitanović. "FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.1
-
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić (2022)In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published ...Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić. "Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection" in Proceedings of the Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)